Main goals of the session
The main goal of this practical class is to reproduce a few examples of the application of the HKA (Hudson, Kreitman and Aguadé 1987) and MK (McDonald and Kreitman 1991) methods to test for selective neutrality in the recent past of different genomic regions. Both tests are based on important predictions of the Neutral Theory of Molecular Evolution (Kimura 1983). The HKA test contrasts the levels of nucleotide polymorphism (variability within population) with the levels of nucleotide divergence (fixed substitutions between species) for different loci (at least one of these loci is expected to be evolving under neutrality). All neutral loci across the genome must have the same ratio of polymorphism to divergence. This is because both estimates of nucleotide variation are proportional to the mutation rate. The MK test is based on the same assumption but it compares nucleotide variation at different classes of sites within the same gene (i.e. synonymous -or silent- versus non-synonymous sites) rather than across genes. If all amino acid substitutions between two species are neutral, we expect the same number of non-synonymous and synonymous (or silent) changes, the latter reflecting the neutral mutation rate. In this practice, you will work with polymorphism data from two different Drosophila species, Drosophila melanogaster and Drosophila subobscura, and divergence data from other two drosophilid species, Drosophila simulans and Drosophila guanche (see Fig. 1).
Figure 1. Drosophila species used in this practical lesson and their phylogenetic relationships.
Throughout the document you will see different icons whose meaning is:
: Additional or useful information
: Practical exercise
: Hint to solve an exercise or to do a task
: Slot to answer a question
: Problem or task to be solved
You can use either the R console in the terminal or
RStudio for this practice. If you don’t have R
installed, you can download the appropriate package for your system here. To install
RStudio, go to this page and
follow the instructions.
Before starting the exercises, you will need to install some
necessary libraries for manipulating and analyzing DNA sequence data.
Open the R console in the terminal (or in
RStudio) and type:
# install.packages("ape")
devtools::install_github("pievos101/PopGenome")
install.packages("remotes")
remotes::install_github("andrewparkermorgan/sfsr")
Data files: link and description (click on the file name to access the file) :
rpL32.fas: This file contains the sequence of the rpL32 gene for 18 individuals (one sequence per individual) of a natural population of D. subobscura (J) and one sequence from a closely related species, D. guanche (Dgua). All sequences of D. subobscura are from the same chromosomal arrangement (O3+4). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.
Acph.fas. This file contains the sequence of the Acph gene for 42 individuals (one sequence per individual) of a natural population of D. subobscura (TB) and one sequence of a closely related species, D. guanche (Dgua). The sequences of D. subobscura are from two different chromosomal arrangements (O3+4+8 and O3+4+23). For this practical, use the 18 sequences of D. subobscura as the polymorphism data and the sequence of D. guanche for divergence estimations.
OS.fas. This file contains the sequence of the OS locus of 14 individuals (one sequence per individual) of an European population of D. melanogaster (M) and 2 individuals of D. simulans (S). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.
E9.fas. This file contains the sequence of the E9 locus of 15 individuals (one sequence per individual) of world wide sample of D. melanogaster (MEL_) and 1 individual (one sequence per individual) of D. simulans (SIM_). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.
E10.fas. This file contains the sequence of the E10 locus for 15 individuals (one sequence per individual) of an African population of D. melanogaster (mel-) and one sequence from a closely related species, D. simulans (Dsim). For this practical, use the 15 sequences of D. melanogaster as the polymorphism data and the sequence of D. simulans for divergence estimations.
Under neutrality, the amount of polymorphism in a species should be correlated with the levels of divergence between species for all loci across the genome due to the dependence of both on the neutral mutation rate (Kimura 1983). The HKA test evaluates the fit of the observed polymorphism and divergence data to this specific prediction (Fig. 2).
Figure 2. The HKA test.
Open the R console in the terminal (or
RSsudio), load the library PopGenome, create a
new folder for the gene, copy the fasta file to this folder, and read
the folder with the function readData. This function
creates an object of class GENOME. Then, indicate which
samples are from the population of D. subobscura and of
D.guanche using the function set.populatons.
library(PopGenome)
## create a new working directory for this gene region and copy the fasta file to this folder
dir.create("rpL32")
file.copy(from = "rpL32.fasta",
to="rpL32/rpL32.fasta")
## read the fasta file
rp <- readData("rpL32")
## summary of the data
get.sum.data(rp)
## set populations
get.individuals(rp)
rp <- set.populations(rp,list(
c(get.individuals(rp)[[1]][1:18]),
c(get.individuals(rp)[[1]][19]))
)
rp@region.data@populations
rp@region.data@populations2
rp<-set.outgroup(rp,c("Dgua"))
To know which indices to use in the function
set.populations(), we use the functionget.individuals(). Check the result of the commandsregion.data@populationsandregion.data@populations2to make sure you have set the populations correctly.
First, let’s have a look at the variability in the region we are
studying. The sliding windonw analysis is very useful for this task. To
examine the distribution of the nucleotide polymorphism across the rp32L
region, use the sliding.window.transform() function to
create a new object (rpsw) where regions correspond to a
set of windows into which you divide your region (in the rp
object, regions correspond to the entire gene region). You will set a
window size of 200 bp and a step size of 10 bp. For example, you can
estimate and plot the number of segregating sites across the gene
region:
## divide the region in windows
rpsw <- sliding.window.transform(rp,width=200, jump=10,type=2,whole.data=TRUE)
## calculate segregating sites (and other statistics)
rpsw <- neutrality.stats(rpsw)
## calculate fixed diferences (and other statistics)
rpsw<-calc.fixed.shared(rpsw)
## get segregating sites in D. subobscura
s<-as.data.frame(rpsw@n.segregating.sites[,1])
## get fixed differences between D. subobscura and D. melanogaster
d<-rpsw@n.fixed.sites
## plot the results of the sliding windows analysis
PopGplot(s,span=0.1,xlab="position (rp32L gene region)", ylim=c(min(s,na.rm=TRUE),max(s,na.rm=TRUE)))
lines(d)
legend("topright", legend = c("polymorphism", "divergence"),
col = c("red", "black"), lty = 1)
1. Is the observed variation (visually) in agreement with expectations under the neutral theory of molecular evolution? Why?
2. What would the plot look like if positive selection (=selective sweep) had recently acted on the 3’ half of the gene? and if balancing selection had acted on the 5’ half of the gene??
To make sure you are on the right track, you can apply the HKA test
to both halves of the rpL32 gene region. The hka()
function from the sfsr package can be used to do this.
However, this function only accepts the site frequency spectrum
(SFS) as input for this calculation. Therefore, you must first
obtain the SFS of the regions you wish to compare in the HKA test. Just
to give you an example, in code bellow you can see how to obtain the SFS
and perform the HKA test on the entire rpL32 gene:
library(sfsr)
## extract the sample size
n<-length(rp@populations[[1]])
## get the minor allele frequencies
rp<-detail.stats(rp)
ma<-as.data.frame(rp@region.stats@minor.allele.freqs[1]) ## IMPORTANT: see info section bellow
ma<-ma[1,]
## obtain the unfolded site frequency spectrum in population 1 (including fixed differences with respect to poopulation 2)
ma<-ma*n
sfs<-c()
for (i in 1:n) {
a<-sum(ma==i)
sfs<-c(sfs, a)
}
sfs
monomorphic<-rp@n.valid.sites - sum(sfs)
rp.sfs<-c(monomorphic)
rp.sfs<-c(rp.sfs, sfs)
## plot the SFS
cols <- c("blue", "red")[(sfs >= sfs[n]) + 1 ]
barplot(sfs, main="Site Frequency Spectrum", names.arg=c(1:n), col=cols)
legend("topleft", legend = c("polymorphic", "fixed"), fill=c("blue","red"))
## run the HKA test:
rp.sfs2<-rp.sfs ## IMPORTANT: see info section bellow
hka_rp<-hka_test(rp.sfs, rp.sfs2)
## p-value of the HKA test
hka_rp$p.value
## observed values
hka_rp$observed
## expected values
hka_rp$expected
## residuals
hka_rp$residuals
IMPORTANT: note that in the example above we have compared one SFS against itself, which doesn’t make any sense!!!. In real cases, you must to compare the SFS from two different regions, one of which should be evolving under neutrality. Also note that when you have more than one fasta file in the input folder, the object contains as many regions as fasta files in the folder. Use
get.sum.data()to know the order of each region and use the correct index when extracting minor.allele.freqs.
3. Do the two halves of the rpL32 gene region evolve as expected under the neutral model?
4. Is there any chance that the region being analysed has been subject to positive selection? Maybe even the entire region?
The MK test also compares polymorphism and divergence data but, in this case, between two types of sites within the same gene (coding region), synonymous (neutral class) and nonsynonymous sites (Figure 3). Under the null hypothesis, all nonsynonymous mutations are expected to be neutral and then the ratio of nonsynonymous to synonymous variation within species (Pn/Ps) is expected to equal the ratio of nonsynonymous to synonymous variation between species (Dn/Ds). However, these ratios will not be equal if part of the nonsynonymous variation is under either positive (i.e., mutations that increase individual fitness) or negative selection (i.e., mutations that are negatively selected but not highly deleterious).
Figure 3. The MK test.
We will use the same genomic region to see an example of how the MK
test is applied to real data. The PopGenome package has a
function to run this test in R. As noted above, the test is
based on synonymous and non-synonymous polymorphisms and substitutions,
so the first task is to extract the coding regions from the alignment of
the whole genomic region (the original alignment includes non-coding
regions):
library(ape)
## Create new folder for the cds file in the working directory of this gene
dir.create("rpL32/rpL32_cds")
## create a new alignment only with cds sequences
aln<-read.dna(file="rpL32/rpL32.fasta", format="fasta")
cds<-aln[,c(932:1024,1122:1430)]
write.dna(cds, file="rpL32/rpL32_cds/rpL32_cds.fasta", format="fasta")
## read the cds sequences with PopGenome
rpc<-readData("rpL32/rpL32_cds")
## summary of the data
get.sum.data(rpc)
## set populations
get.individuals(rpc)
rpc <- set.populations(rpc,list(
c(get.individuals(rpc)[[1]][1:18]),
c(get.individuals(rpc)[[1]][19]))
)
rpc<-set.outgroup(rpc,c("Dgua"))
## MK test - fisher test
rpc<-MKT(rpc, do.fisher.test=TRUE)
get.MKT(rpc)
5. What is the result of this test? Can we infer the action of positive selection from the divergence of rpL32 between these two species?
6. How many amino acid changes have occurred in the RpL32 protein since the divergence of D. subobscura and D. guanche? And how many synonymous changes? Discuss these results in relation to the evolutionary rate estimated for this protein in the final assignment of practical 4).
Based on the results of these test, answer the following questions:
7. Are the E9 or E10 gene regions under positive selection in D. melanogaster?
8. If I say that the region referred to E10 is evolving under neutrality (it is a noncoding region with no functional element), does this modify your answer to question 7?
9. Is there evidence of selection in two genes analysed by the MK test? What type of selection (positive or negative)?.
Deliver this document in AULAESCI with your answers
Deadline: June 28, 2024 - 23:59
Doubts? alejandro.sanchez@prof.esci.upf.edu